README

This repository contains the datasets and the code for replicating all analysis and visualization in the publication. Note that the analysis was originally done in a different way using the Perl API of the IMS Open Corpus Workbench (https://cwb.sourceforge.io/). However, the scripts and datasets in this repository produce the same results.

In detail, the repository contains:

1) Frequency lists of the corpus for three different positional attributes (words, part-of-speech, lemmas) as well as a frequency list of trigrams broken down according to the ratings. The frequency lists were created using the tool cwb-scan-corpus with the following commands:

$ cwb-scan-corpus EVENTIM lemma+0 text_rating > lemma.csv
$ cwb-scan-corpus EVENTIM word+0 text_rating > word.csv 
$ cwb-scan-corpus EVENTIM pos+0 text_rating > pos.csv 
$ cwb-scan-corpus EVENTIM lemma+0 lemma+1 lemma+2 text_rating > trigram.csv

2) A Jupyter notebook for keyword and key-ngram calculation and for calculation of new frequency lists.

3) A table artists.csv containing all the reviewed artists with the corresponding number of texts.

4) A table textlength.csv containing the number of tokens and the rating of every text in the corpus. The mean values can be computed with the R script textlength.R

5) A table tokens_per_rating.csv showing the subcorpus sizes.

6) A series of distribution tables containing the (relativ) frequencies of selected lemmas and other linguistic phenomena broken down by the ratings. 

7) The distribution graphs as png-Files as shown in the publication. They can be produced with the R script distributions.R.

If you want to query the full texts, you can request corpus access at https://fussballlinguistik.de/korpora/registrierung/. Please make a note in the registration form that you would like to query the Fan Reports corpus.